Gap-free genome assembly of Salangid icefish Neosalanx taihuensis

Neosalanx taihuensis is widely distributed in freshwater and brackish water areas in China. Due to its high commercial value, it has been artificially introduced into many lakes and reservoirs, showing strong ecological adaptability. Here, a gap-free chromosome-level reference genome was constructed by combining short reads, PacBio HiFi long reads, Nanopore ultralong reads and Hi-C data. The reference genome of N. taihuensis was 397.29 Mb with a contig N50 of 15.61 Mb. The assembled sequences were anchored to 28 chromosomes. Furthermore, 20,024 protein-coding genes and 98.16% of the predicted genes were annotated in publicly available biological databases. This high-quality gap-free assembled genome will provide an essential reference for studying the evolution and ecological adaptability of N. taihuensis.


Background & Summary
Neosalanx taihuensis, a member of the Salangidae family of the Osmeriformes, is an economically important aquaculture fish in China with a transparent body and feeds on zooplankton [1][2][3] .N. taihuensis is endemic to fresh and brackish waters widely distributed in China and has been artificially introduced to numerous lakes and reservoirs due to its high commercial value 4 .The natural population is not only distributed in the estuary area from the Yellow and Bohai Seas to the coast of the South China Sea but also in the main streams of the Yangtze River, Huai River and Yellow River and their subsidiary lakes 5,6 .Among these sites, the Yangtze River basin and Taihu Lake are the core habitats of the natural population of N. taihuensis [7][8][9] .The natural N. taihuensis population size has declined seriously due to overfishing and habitat destruction 10 .Fortunately, artificial translocation activities for N. taihuensis have greatly expanded the spatial-geographic distribution and population diversity of the species 7,[11][12][13] .Translocation activities in waters such as the Erhai Sea, Fuxian Lake, Dianchi Lake and the Three Gorges Reservoir have resulted in the formation of stable populations of N. taihuensis in these new habitats 8,11,14,15 .The study of genetic diversity between translocated and natural populations has become an interesting issue for researchers, and a variety of molecular markers, including COI, microsatellites, and Cytb, have been developed 7,9,11,16,17 .The analysis of these markers has shown that the genetic diversity of the translocated population of N. taihuensis was higher than that of the natural population and has preliminarily revealed the molecular mechanism of N. taihuensis adaptation to the environment 7,9,11,16,17 .
More specifically, translocated N. taihuensis also exhibit plasticity in their reproductive biology.In natural habitats such as the Yangtze River basin and Taihu Lake, N. taihuensis commonly has two breeding groups, a spring breeding group and an autumn breeding group [18][19][20][21][22][23] , with the spring breeding group being the main source of population supplementation 21,24 .In contrast, the reproductive pattern of the translocated population of N. taihuensis has changed.The reproductive behavior of the translocated N. taihuensis population in Erhai shows only one spawning period, i.e., from late autumn to early winter 22 .In contrast, the translocated population of N. taihuensis in Dianchi has formed three reproductive groups, namely, the winter group, the autumn group and the spring group 23 .The translocated population has a longer reproductive period than the natural population, showing a more obvious adaptation in reproductive strategy.The current research on the differentiation of reproductive populations of N. taihuensis is limited to the description and statistics of epigenetic phenomena, and few studies have investigated the relevant genetic mechanism and molecular evolution.
Until now, molecular biology and genomic research on N. taihuensis has been rare due to the lack of a reference genome.The lack of information on the N. taihuensis genome greatly limits the study of N. taihuensis phylogeny and genetic differentiation.Likewise, it is not possible to explore the adaptation and reproductive strategies of N. taihuensis at the genomic level.
In this study, we report a gap-free genome assembly for N. taihuensis combining short reads, PacBio HiFi long reads, Nanopore ultralong reads and Hi-C data.The assembled N. taihuensis genome was approximately 397.29 Mb with a contig N50 of 15.61 Mb.Gene annotation yielded 20,024 protein-coding genes, and 98.16% of the predicted genes were annotated in publicly available biological databases, including NR, GO, KOG, KEGG, TrEMBL, Interpro and SwissProt.This high-quality, gap-free assembled genome will provide an important resource for studying the reproductive biology and ecological adaptability of N. taihuensis.

Methods ethics declarations. This work was approved by the Bioethical Committee of Freshwater Fisheries Research
Center (FFRC) of the Chinese Academy of Fishery Sciences (CAFS) (FEH20200807, 2020/08/07).Sampling was performed in strict accordance with Freshwater Fisheries Research Center Experimental Animal Ethics Guidelines.

Sample collection.
Muscle tissue samples were collected from adult N. taihuensis for this study (Fig. 1).
The collection site was located at Taihu Lake, Huzhou, Zhejiang Province (coordinates: E120°5′0.999996″,N31°0′59.999976″).Sampling was performed in strict accordance with relevant Chinese laws and experimental ethical guidelines.After the muscle tissue samples were collected, they were rapidly frozen in liquid nitrogen and stored at −80 °C until DNA extraction.
DNA and RNA extraction, library construction, sequencing, assembly, and bioinformatics analyses in this study were performed using standard experimental and analytical protocols from BGI Genomics (Shenzhen, China).rNa isolate, cDNa library construction and sequencing.For gene structure annotation, RNA was isolated from the muscle tissue samples using the TRIzol Total RNA Isolation Kit (Takara, USA) following the manufacturer's protocols 25 .Then, the RNA was sheared and reverse transcribed using random primers to obtain cDNA, which was used for library construction.The library quality was determined using a Bioanalyzer 2100.Subsequently, these libraries underwent paired-end sequencing with a read length of 150 bp on the BGISEQ sequencing platform (BGI).

WGS library construction, sequencing and genome survey.
Extracted DNA from N. taihuensis muscle tissue using hypervariable minisatellite probe (MZ 1.3), along with locus-specific minisatellite probes (g3, MS1, MS43).Fragmented this DNA between 50 and 800 bp using a Covaris E220 ultrasonicator, following manufacturer guidelines, creating a short insert whole-genome shotgun (WGS) library.Built and sequenced a library with fragments between 300 and 400 bp on the MGISEQ platform.Generated 45.69 Gb DNBSEQ data for short inserts, offering insights into the N. taihuensis genome (Table 1).Utilized the FastQC (v0.1) 26 to remove low-quality or adapter-linked reads.From the refined data, determined the K-mer frequency distribution using Jellyfish (v2.2.6) 27 and analyzed with GenomeScope (v1.0) 28 .Determined the N. taihuensis genome to be around 356 Mb with a heterozygosity rate of 0.77% (Fig. 2 and Table 2).
PacBio library construction, sequencing and de novo assembly.DNA from N. taihuensis muscle tissue was extracted using a QIAGEN Blood & Cell Culture DNA Midi Kit (QIAGEN, Germany).A PacBio library with an insert size of around 20 kb was then prepared using the SMRTbell Express Template Prep Kit 2.0 from PacBio (Pacific Biosciences, USA).It was sequenced on a PacBio Sequel II SMRT cell in CCS mode.After processing with the SMRT Link (v8.0.0) 29 CCS algorithm with parameters "--minPasses 3 --minPredictedAccuracy 0.99 --minLength 500", 25.88 Gb HiFi reads were obtained, excluding adaptors and less accurate reads.The reads had  an N50 length of 15.17 kb and an average length of 14.91 kb (Table 1).The initial genome de novo assembly was done using Hifiasm (v0.15.1) 30 with standard settings, and any redundant sequences were later purged using the Purge-Haplotigs 31 program with the parameters "-j 80 -s 80 -a 75".

Hi-C library preparation, sequencing and chromosome anchoring.
A Hi-C library was created using the Mbo I restriction enzyme 32 .Muscle tissue samples underwent 1% formaldehyde treatment at room temperature for 10-30 minutes to crosslink chromatin-interacting proteins.Post-digestion with Mbo I restriction enzyme (NEB, Ipswich, USA), fragment ends were flattened, repaired, biotin-labeled, and ligated to form loops using T4 DNA ligase (Thermo Scientific, USA).After protein removal and ultrasound disruption of the loops, the Hi-C library was sequenced on an MGISEQ platform.For the chromosome-level assembly, 69.21 Gb of Hi-C sequencing data were produced, leading to the clustering, ordering, and orientation of contigs into 28 pseudochromosomes using Juicer (v1.5) 33 and 3D-DNA (v180922) 34 pipelines (Table 1).Scaffolding errors were later reviewed and curated using Juicebox (v1.11.08) 33 .
the protein-coding genes that were only derived from ab initio prediction were filtered out.Overall, 20,400 protein-coding genes were obtained with an average gene length of 8,921 bp and an average CDS length of 1,673 bp.The average exon number per gene was 9, with an average exon length of 177 bp and an average intron length of 858 bp (Table 7).The final gene models predicted above were then annotated using the NCBI nonredundant (NR) protein database (97.3%) and the Swissprot 48 (88.08%),KEGG 49 (85.63%),KOG 50 (76.36%),TrEMBL 49 (97.64%),InterPro 51 (90.93%) and Gene Ontology (GO) 52 (67.44%) databases.In total, 20,024 (98.16%) gene models were annotated for at least one homologous hit by searching against these public databases (       functional proteins, 14,776 (~72.4%) were supported by the data of five databases (InterPro, KEGG, NR, KOG, SwissPort) (Fig. 5).

Data Records
All the raw data for the whole genome have been deposited into the National Center for Biotechnology Information (NCBI) SRA database (Accessions for SRR22936158 to SRR22936161) under BioProject accession number PRJNA915819 53 .The Whole Genome Shotgun project has been deposited at GenBank under accession JARGSH000000000 54 .
The files for N. taihuensis gene structure annotation, gene functional annotation and repeat annotation have been deposited at Figshare 55 .

Technical Validation
evaluation of the genome assembly.To compare the assembled metrics for N. taihuensis and the other Salangidae species, the assembly in this study was to the gap-free chromosome-scale assembly level (Table 3).The contig N50 of our assembly was 15.61 Mb, while that of P. chinensis 56 was 103.01 Kb and that of P. hyalocranius 57 was 17.74 Kb.The contig number for our assembly was 137, while that of P. chinensis was 11,196 and that of P. hyalocranius was 19,755.These statistics indicated that our assembly had reached a higher contiguous level (Table 3).
The completeness was evaluated using BUSCO 58 analysis.BUSCO analysis revealed that 91.5% (single-copied gene: 90.0%, duplicated gene: 1.5%) of 3,640 single-copy orthologs (in the actinopterygii_odb10 database) were successfully identified as complete, 1.8% were fragmented and 6.7% were missing in the assembly (BUSCO v5.1.0).The accuracy rate was evaluated by mapping the sequencing data to the assembled genome.The mapping rates were 94.63%, 99.8% and 100% for the DNBSEQ, PacBio data and Nanopore data, respectively.evaluation of the gene annotation.The completeness and accuracy of the gene structure annotation were evaluated using three different strategies.First, BUSCO analysis revealed that 90.9% (single-copy gene: 88.5%, duplicated gene: 2.4%) of 3,640 single-copy orthologs (in the actinopterygii_odb10 database) were successfully identified as complete, while 1.6% were fragmented and 7.5% were missing in the assembly (BUSCO v5.1.0)(Table 9).Second, to determine if there was evidence of de novo annotation, homolog-based annotation and transcripts, we calculated the CDS overlap content between the final gene sets with the prediction results from the above three different methods.The results showed that more than 99.78% of genes were occupied by these three prediction results with a CDS overlap ratio greater than 80% (Table 10).Moreover, we compared the length distribution of genes, coding sequences (CDS), exons and introns among the D. rerio, O. latipes, P. hyalocranius and S. salar genomes and found similar distributions of these parameters (Fig. 6).

Fig. 4
Fig. 4 Circos plot of the N. taihuensis genome.The rings from inside to outside indicate (a) pseudochromosome length of the N. taihuensis genome, (b) gene frequency, (c) gene density, (d) TE density, and (e) GC density; b-d were drawn in 500-kb sliding windows.

Fig. 5
Fig. 5 Venn diagram of the number of genes with homology or functional classification by each method.The Venn diagram shows the shared and unique annotations among InterPro, KEGG, KOG, NR and SwissProt.

Fig. 6
Fig. 6 The composition of gene elements in the N. taihuensis genome compared to the genomes of other species.(a) mRNA length distribution and comparison with other species.(b) Exon length distribution and comparison with other species.(c) CDS length distribution and comparison with other species.(d) Intron length distribution and comparison with other species.(e) Exon number distribution and comparison with other species.

Table 1 .
Sequencing data used for the genome N. taihuensis assembly.

Table 2 .
The information of genome survey analysis.

Table 3 .
The statistics of length and number for the de novo assembled of N. taihuensis, P. chinensis and P. hyalocranius genomes.Fig.3Characteristics of the N. taihuensis genome.Hi-C chromatin interaction map of the N. taihuensis assembly.

Table 4 .
Statistics of chromosomal level assembly of N. taihuensis genome.

Table 5 .
Statistics of repetitive sequences in the N. taihuensis genome.

Table 6 .
The genome information of four actinopterygii species.

Table 10 .
The evidence supporting gene models of the N. taihuensis genome.